Data Partitioning for Parallel Entity Matching

نویسندگان

  • Toralf Kirsten
  • Lars Kolb
  • Michael Hartung
  • Anika Groß
  • Hanna Köpcke
  • Erhard Rahm
چکیده

Entity matching is an important and difficult step for integrating web data. To reduce the typically high execution time for matching we investigate how we can perform entity matching in parallel on a distributed infrastructure. We propose different strategies to partition the input data and generate multiple match tasks that can be independently executed. One of our strategies supports both, blocking to reduce the search space for matching and parallel matching to improve efficiency. Special attention is given to the number and size of data partitions as they impact the overall communication overhead and memory requirements of individual match tasks. We have developed a service-based distributed infrastructure for the parallel execution of match workflows. We evaluate our approach in detail for different match strategies for matching real-world product data of different web shops. We also consider caching of input entities and affinity-based scheduling of match tasks.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Adaptive Approximate Record Matching

Typographical data entry errors and incomplete documents, produce imperfect records in real world databases. These errors generate distinct records which belong to the same entity. The aim of Approximate Record Matching is to find multiple records which belong to an entity. In this paper, an algorithm for Approximate Record Matching is proposed that can be adapted automatically with input error...

متن کامل

Partitioning Strategy Selection for In-Memory Graph Pattern Matching on Multiprocessor Systems

Pattern matching on large graphs is the foundation for a variety of application domains. The continuously increasing size of the underlying graphs requires highly parallel in-memory graph processing engines that need to consider non-uniform memory access (NUMA) and concurrency issues to scale up on modern multiprocessor systems. To tackle these aspects, a fine-grained graph partitioning becomes...

متن کامل

Diverse information integration and visualization

This paper presents and explores a technique for visually integrating and exploring diverse information. Researchers and analysts seeking knowledge and understanding of complex systems have increasing access to related, but diverse, data. These data provide an opportunity to consider entities of interest from multiple informational perspectives not available from any single, data or information...

متن کامل

Ontology-Driven Data Partitioning and Recovery for Flexible Query Answering

Flexible Query Answering helps users find relevant information to their queries even if no exactly matching answers can be found in a database system. However, relaxing query conditions at runtime is inherently slow and does not scale as the data set grows. In this paper we propose a method to partition the data by using an ontology that semantically guides the query relaxation. Moreover, if se...

متن کامل

A High Performance Parallel IP Lookup Technique Using Distributed Memory Organization and ISCB-Tree Data Structure

The IP Lookup Process is a key bottleneck in routing due to the increase in routing table size, increasing traıc and migration to IPv6 addresses. The IP address lookup involves computation of the Longest Prefix Matching (LPM), which existing solutions such as BSD Radix Tries, scale poorly when traıc in the router increases or when employed for IPv6 address lookups. In this paper, we describe a ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/1006.5309  شماره 

صفحات  -

تاریخ انتشار 2010